GEM-PRO - Genes & Sequences¶
This notebook gives an example of how to run the GEM-PRO pipeline with a dictionary of gene IDs and their protein sequences.
Imports¶
In [1]:
import sys
import logging
In [2]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO
In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Logging¶
Set the logging level in logger.setLevel(logging.<LEVEL_HERE>)
to
specify how verbose you want the pipeline to be. Debug is most verbose.
CRITICAL
- Only really important messages shown
ERROR
- Major errors
WARNING
- Warnings that don’t affect running of the pipeline
INFO
(default)- Info such as the number of structures mapped per gene
DEBUG
- Really detailed information that will print out a lot of stuff
DEBUG
mode prints out a large amount of information,
especially if you have a lot of genes. This may stall your notebook!
In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO) # SET YOUR LOGGING LEVEL HERE #
In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]
Initialization of the project¶
Set these three things:
ROOT_DIR
- The directory where a folder named after your
PROJECT
will be created
- The directory where a folder named after your
PROJECT
- Your project name
LIST_OF_GENES
- Your list of gene IDs
A directory will be created in ROOT_DIR
with your PROJECT
name.
The folders are organized like so:
ROOT_DIR
└── PROJECT
├── data # General storage for pipeline outputs
├── model # SBML and GEM-PRO models are stored here
├── genes # Per gene information
│ ├── <gene_id1> # Specific gene directory
│ │ └── protein
│ │ ├── sequences # Protein sequence files, alignments, etc.
│ │ └── structures # Protein structure files, calculations, etc.
│ └── <gene_id2>
│ └── protein
│ ├── sequences
│ └── structures
├── reactions # Per reaction information
│ └── <reaction_id1> # Specific reaction directory
│ └── complex
│ └── structures # Protein complex files
└── metabolites # Per metabolite information
└── <metabolite_id1> # Specific metabolite directory
└── chemical
└── structures # Metabolite 2D and 3D structure files
In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()
PROJECT = 'genes_and_sequences_GP'
GENES_AND_SEQUENCES = {'b0870': 'MIDLRSDTVTRPSRAMLEAMMAAPVGDDVYGDDPTVNALQDYAAELSGKEAAIFLPTGTQANLVALLSHCERGEEYIVGQAAHNYLFEAGGAAVLGSIQPQPIDAAADGTLPLDKVAMKIKPDDIHFARTKLLSLENTHNGKVLPREYLKEAWEFTRERNLALHVDGARIFNAVVAYGCELKEITQYCDSFTICLSKGLGTPVGSLLVGNRDYIKRAIRWRKMTGGGMRQSGILAAAGIYALKNNVARLQEDHDNAAWMAEQLREAGADVMRQDTNMLFVRVGEENAAALGEYMKARNVLINASPIVRLVTHLDVSREQLAEVAAHWRAFLAR',
'b3041': 'MNQTLLSSFGTPFERVENALAALREGRGVMVLDDEDRENEGDMIFPAETMTVEQMALTIRHGSGIVCLCITEDRRKQLDLPMMVENNTSAYGTGFTVTIEAAEGVTTGVSAADRITTVRAAIADGAKPSDLNRPGHVFPLRAQAGGVLTRGGHTEATIDLMTLAGFKPAGVLCELTNDDGTMARAPECIEFANKHNMALVTIEDLVAYRQAHERKAS'}
PDB_FILE_TYPE = 'mmtf'
In [7]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_and_sequences=GENES_AND_SEQUENCES, pdb_file_type=PDB_FILE_TYPE)
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: Creating GEM-PRO project directory in folder /tmp
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: /tmp/genes_and_sequences_GP: GEM-PRO project location
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: Loaded in 2 sequences
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: 2: number of genes
Mapping sequence –> structure¶
Since the sequences have been provided, we just need to BLAST them to the PDB.
Methods¶
In [8]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: 2: number of genes with additional structures added from BLAST
Out[8]:
pdb_id | pdb_chain_id | hit_score | hit_evalue | hit_percent_similar | hit_percent_ident | hit_num_ident | hit_num_similar | |
---|---|---|---|---|---|---|---|---|
gene | ||||||||
b0870 | 3wlx | A | 1713.0 | 0.0 | 1.0 | 1.0 | 333 | 333 |
b0870 | 3wlx | B | 1713.0 | 0.0 | 1.0 | 1.0 | 333 | 333 |
Downloading and ranking structures¶
Methods¶
In [9]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: Updated PDB metadata dataframe. See the "df_pdb_metadata" attribute for a summary dataframe.
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: Saved 11 structures total
Out[9]:
pdb_id | pdb_title | description | experimental_method | mapped_chains | resolution | chemicals | taxonomy_name | structure_file | |
---|---|---|---|---|---|---|---|---|---|
gene | |||||||||
b0870 | 3wlx | Crystal structure of low-specificity L-threoni... | Low specificity L-threonine aldolase (E.C.4.1.... | X-RAY DIFFRACTION | A;B | 2.51 | PLG | Escherichia coli | 3wlx.mmtf |
b0870 | 4lnj | Structure of Escherichia coli Threonine Aldola... | Low-specificity L-threonine aldolase (E.C.4.1.... | X-RAY DIFFRACTION | A;B | 2.10 | EPE;MG;PLR | Escherichia coli | 4lnj.mmtf |
In [10]:
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 2/2: number of genes with a representative structure
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.
Out[10]:
id | is_experimental | file_type | structure_file | |
---|---|---|---|---|
gene | ||||
b0870 | REP-3wlx | True | pdb | 3wlx-A_clean.pdb |
b3041 | REP-1iez | True | pdb | 1iez-A_clean.pdb |
In [11]:
# Looking at the information saved within a gene
my_gempro.genes.get_by_id('b0870').protein.representative_structure
my_gempro.genes.get_by_id('b0870').protein.representative_structure.get_dict()
Out[11]:
<StructProp REP-3wlx at 0x7f9a0ae345f8>
Out[11]:
{'_structure_dir': '/tmp/genes_and_sequences_GP/genes/b0870/b0870_protein/structures',
'chains': [<ChainProp A at 0x7f99fbd55710>],
'date': None,
'description': 'Low specificity L-threonine aldolase (E.C.4.1.2.48)',
'file_type': 'pdb',
'id': 'REP-3wlx',
'is_experimental': True,
'mapped_chains': ['A'],
'notes': {},
'original_structure_id': '3wlx',
'resolution': 2.51,
'structure_file': '3wlx-A_clean.pdb',
'taxonomy_name': 'Escherichia coli'}
Creating homology models¶
For those proteins with no representative structure, we can create
homology models for them. ssbio
contains some built in functions for
easily running
I-TASSER
locally or on machines with SLURM
(ie. on NERSC) or Torque
job
scheduling.
You can load in I-TASSER models once they complete using the
get_itasser_models
later.
Methods¶
In [12]:
# Prep I-TASSER model folders
my_gempro.prep_itasser_modeling('~/software/I-TASSER4.4', '~/software/ITLIB/', runtype='local', all_genes=False)
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Prepared I-TASSER modeling folders for 0 genes in folder /tmp/genes_and_sequences_GP/data/homology_models
Saving your GEM-PRO¶
In [13]:
import os.path as op
my_gempro.save_json(op.join(my_gempro.model_dir, '{}.json'.format(my_gempro.id)), compression=False)
[2018-02-05 18:12] [root] WARNING: json-tricks: numpy scalar serialization is experimental and may work differently in future versions
[2018-02-05 18:12] [ssbio.io] INFO: Saved <class 'ssbio.pipeline.gempro.GEMPRO'> (id: genes_and_sequences_GP) to /tmp/genes_and_sequences_GP/model/genes_and_sequences_GP.json